Genetic data and kinfinding (lite)
Paul B. Conn
Euring Technical Meeting, Friday 21 April, 2023
I am not a geneticist!
The proportion of genes inherited from a common ancestor
Different expectations for the proportion of shared alleles that are identical by descent at randomly selected loci as a function of kin type (sexual reproduction, diploid)
| Kin type | \(\kappa_0\) | \(\kappa_1\) | \(\kappa_2\) |
|---|---|---|---|
| Parent-Offspring (PO) | 0 | 1 | 0 |
| Full sibling (FS) | 0.25 | 0.5 | 0.25 |
| Half sibling (HS) | 0.5 | 0.5 | 0 |
| Grandparent-grandchild (GG) | 0.5 | 0.5 | 0 |
| Full aunt niece (FAN) | 0.5 | 0.5 | 0 |
| Half aunt-niece (HAN) | 0.75 | 0.25 | |
| First cousin | 0.75 | 0.25 | 0 |
| Half first-cousin | 0.875 | 0.125 | 0 |
| Unrelated | 1.0 | 0 | 0 |
\(\kappa_a\) - expected fraction of genome sharing \(a\) alleles IBD and a random locus
Let’s look at the probability of two individuals having particular allele combinations at a specific loci conditional on an underlying kin relationship. Note that individuals can have the same alleles even if they arent ibd.
Animal \(i\): \(G_i\) = AB = BA
Animal \(j\): \(G_j\) = BB
We need to know the frequency of these loci in the population! Call these, \(p_A\) and \(p_B\).
If the animals are unrelated we have
\(Pr (G_i, G_j | \text{unrelated}) = Pr(G_i ) Pr (G_j)\)
\(Pr(G_i) = 1 - p_A^2 - p_B^2\)
\(Pr(G_j ) = p_B^2\)
So, \(Pr(G_i, G_j| \text{unrelated}) = p_B^2 (1-p_A^2-p_B^2)\)
If the animals are POPs we no longer have independence. But,
\[ Pr(G_i, G_j | POP) = Pr(G_j) Pr(G_i| G_j, POP) = p_B^2 p_A \]
Since either A or B must be ibd, and the other can be assumed to be random according to marginal population-level probabilities
\[\\[0.5in]\]
General formulation
\(P(G_i,G_j | \boldsymbol{\kappa}) = \kappa_0 P_0(G_i,G_j) + \kappa_1 P_1(G_i,G_j) + \kappa_2 P_2(G_i,G_j)\)
Note that \(P_0(G_i,G_j)\) is simply the product of marginals, and \(P_2(G_i,G_j)\) is an indicator function. The only nuance is in \(P_1\)!
The slide previous gives the probability of two genotypes at a given locus conditional on a particular kin relationship
To quantify evidence for a particular relationship, we’ll use likelihood ratios E.g.,
\[ \frac{L_{PO}(G_i,G_j)}{L_U{G_i,G_j)}} = \frac{P(G_i,G_j | \boldsymbol{\kappa}(PO))}{P(G_i,G_j | \boldsymbol{\kappa}(U))} \]
We’ll also want to extend things to be multi-allelic
\[ \frac{L_{PO}({\bf G}_i,{\bf G}_j)}{L_U{{\bf G}_i,{\bf G}_j)}} = \frac{P({\bf G}_i,{\bf G}_j | \boldsymbol{\kappa}(PO))}{P({\bf G}_i,{\bf G}_j | \boldsymbol{\kappa}(U))} \]
where e.g.
\[ P({\bf G}_i,{\bf G}_j | \boldsymbol{\kappa}(PO)) = \prod_k P(G_{ik},G_{jk} | \boldsymbol{\kappa}(PO)) \]
Note taking a product implies independence, something that will be violated by linkage
More on this in a few slides!
Here’s what we want our likelihood ratios to looks like!!
Unfortunately, this is more often the case:
For real problems, we don’t get the colors! However we have an approximate idea of where the means should be (assuming Hardy-Weinberg, no linkage, etc)
Several strategies:
-Use enough loci (and with sufficient heterozygosity) to separate the curves. In our bearded seal study, after QA/QC we had 2,569 loci and still had to deal with it though!
-Fit a normal mixture model, specify a false-negative threshold that effectively eliminates the probability of false positives, and use estimated false negative probability \(\alpha\) in CKMR likelihood
i.e., replace \(Pr(HSP)\) with \(Pr(HSP)(1-\alpha)\) (stay tuned)
Eric Anderson has a great R package, CKMRsim, for kinfinding. By conducting simulations that “sprinkle” proposed loci into a like genome, one can also see what linkage and genotyping errors will tend to do to potential variances
Using simulations based on the number of loci and approximate genomic structure of bearded seals:
-Actual bearded seal data kin-finding used M. Bravington’s and S. Baylis’ kinference package
-Strange shape (left-skewed) - we estimated variance with the right half-normal and considered different log-odds thresholds and associated probabilities in CKMR sensitivity runs. Hopefully would go problem with go away with more data!
But how do we go from tissue samples to alleles at different loci??
Much more thorough (and accurate) summary at https://eriqande.github.io/tws-ckmr-2022/slides/eric-talk-1.html
A non-exhaustive list of markers/“loci”:
Microsatellites: highly polymorphic (which is good!). These are usually what is used in “regular” genetic mark-recapture, but sample sizes likely not high enough for many kin finding tasks (esp half-sibs). qPCR requires known sequences!
SNPs i.e., Single nucleotide polymorphisms. Can be identified with next generation sequencing, which can ID thousands of loci! Rapid increase in CKMR studies/literature likely attributable to NGS technology.
I’ll focus on SNPs via NGS which is what CSIRO uses for fisheries assessments and we used for bearded seals
CSIRO uses Diversity Arrays Technology (DArT), a company with academic ties out of Australia.
Basic workflow:
Prep samples
Use an initial set of DNA samples (100s of individuals) with DArTseq technology to identify candidate SNP markers (often in the 10s of 1000s)
Identify those markers that will be the most useful (appear to “behave” correctly, include sufficient allele diversity)
Run DArTcap on all tissue samples with reduced set of markers with baits developed from DArTseq
Further QA/QC
Summarize genotypes at surviving loci
DArTseq data are actually count data! Counts of # of alleles at different loci made in replicate sequencings. These need to be converted to genotypes…
4-way genotyping setup: \(AA0\), \(BB0\), \(AB\), \(00\)
6-way genotyping setup: \(AA\), \(A0\), \(AB\), \(BB\), \(B0\), \(00\)
Bearded seal study: